Comparison of Genre based Tamil Songs Classification using term Frequency and Inverse Document Frequency

 

Sundara Kanchana1*, Meenakshi.K and2, Velappa Ganapathy3

1Academic Administrator, School of Computing, SRM University, Kancheepuram-603203, Tamil Nadu, India

2Assistant Professor, School of Computing, SRM University, Kancheepuram-603203, Tamil Nadu, India

3Professor, School of Computing, SRM University, Kancheepuram-603203, Tamil Nadu

*Corresponding Author E-mail: kanchana.j@ktr.srmuniv.ac.in , meenakshi.k@ktr.srmuniv.ac.in, ganapathy.v@ktr.srmuniv.ac.in

 

ABSTRACT:

Genre classification of Tamil songs is done based on mood, emotion etc. of the listener. The musical genre classification is based on three levels as base, mood and style. Our proposed methodology is comparison of genre based Tamil songs classification using Term Frequency and Inverse Document Frequency (tf-idf). So for this methodology has been applied for English and Korean songs only. In this paper, we are classifying the Tamil songs using the method based on tf-idf scores. The classifier establishes a relation between the features of the training samples and related categories. The frequency of word usage is identified by term frequency and inverse document frequency. Support Vector Machine (SVM) algorithm and Naïve Bayes algorithm (NB) are used in Weka classification tool for analysis.  We have compared the experimental results of both the algorithms in musical genre classification. From the experimental values obtained using various parameters, we have shown that the Naïve Bayes algorithm has classified the genre marginally better than Support Vector Machine algorithm. We have considered 2000 Tamil songs for testing the genre classification. At present we have considered two genres for classification and the same classification can be extended to other genres in future.

 

KEYWORDS:  tf–idf, genre, lyrics, Tamil songs, base, mood, style

 

 

 


INTRODUCTION:

Users would like to access the songs in a customized manner due to the increasing size of online or digital music collections. Users search the songs in web of their choice based on various features such as genre, style, lyricist, year in which the lyric was released, play count, mood etc. Genre classification is the process of grouping songs based on defined similarities like Style, Mood and Emotion. 

 

Particularly, a recent study has revealed that the users are interested to browse and search the songs by genre than based on some recommendation.  So far no research work has been done regarding the selection or searching for songs based on genre classification of Tamil Songs using tf-idf. By another study, it is brought out that genre is important to listeners and style can influence their attraction to Tamil language. Tamil is one of the classical languages and it is very rich in morphology. To retrieve the root from its original form is the objective of the Tamil Morphological Analyzer tool. In Tamil language most words are strongly inflected. For example, Tamil verbs convey different meanings at different places such as person, gender, number of subjects etc. Many different music styles are available in all over the world. Music genre can be divided in to many ways based on the specific composition. These classifications are often arbitrary and controversial. The closely related style may overlap the labeling music with genres that are not specific to Culture, Race or Time period. Larger genres may contain of more specific sub genres.

 

Literature Survey:

In recent years, the tf-idf weighing scheme is utilized for genre classification. The document is prepared by collecting the lyrics of all English songs of a particular genre, combined and it can be named as “Raggie Document” [1]. The genre of the musical document is decided by the Raggie which is a feature for classification based on Raggie Document. Each document corresponds to a specific class. Average results of genre classification are calculated using the following weighing functions are given below. The functions are 

 

tf-idf, tf+tf-idf, wf-idf, lwf-idf and nlwf-idf .

 

where wf-idf is represented as wf *idf , in which wf denotes the term’s weight(t) in the document(d), lwf-idf indicates linearly combined  correlated term weights, and nlwf-idf  is non linearly combined correlated term weights. Using the Weka [Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java] toolkit, classification of the English songs are done and the accuracies of the classifications are compared with the observations obtained using NB, SVM and KNN algorithms.  Around 1000 Tamil songs in paadal portal were classified manually applying the three-level genre classification system (Base, Mood and Style).

 

Ambiguities that arose in the manual classification process were resolved by assigning the most appropriate tag. Three levels of genres namely Base, Mood and Style are utilized to classify the songs. The Base ten classifications are Character, Philosophy, Festival, Nature, Relationship, Romance, Occasion, Spiritual, Patriotic, and Miscellaneous. Emotions classifications are Happy, Sad, Excited, Tender, Scared and Angry. The Styles are categorized into Traditional, Folk, Contemporary and Mixed [2]. The genre classification is based on the lyrics emotions using partial syntactic analysis. Based on the existing Emotions ontology, four different syntactic rules are applied to extract the emotions from given lyrics. Using Weka toolkit, classification of the Korean songs was done and the accuracy of classification was compared with the classifications obtained using NB, SVM and HMM machine learning methods [3]. Automatic Genre Classification using large high level musical feature sets extracts 109 features from symbolic musical recordings. Depending on instrumentation, texture, rhythm, dynamics, pitch statistics, method and chords, classifications are made. Using genetic algorithms and supervised learning system the classifications are performed based on relative weightings. These systems have made use of labeled training samples for classification. The classifier forms a relation between the features of the training samples and related categories. They have used Neural Networks (NN) and K-Nearest Neighbors (KNN) algorithms for classification [4].

 

The large list of channels and programs now available to view the content, help users to organize and navigate their choice of selecting songs. In this paper, the problem of television content classification of text information extracted from television programs is discussed. The classification is based on Vector Space Model (VSM) and Meta feature approaches, which are provided better results [5]. User enjoys a song mostly in terms of its rhythm and pitch in perceptual feature-based song in which genre classification done by RANSAC. Melody and rhythm are focused in this paper. For better classification, random sampling and consensus RANSAC are used as the classifiers [6]. The audio features are combined with lyrics for classification of lyrics and Part Of Speech (POS) feature is also used for classification [7]. Musical features are automatically extracted from the audio signal. Classifications are done based on the psychological studies and using supervised learning approach [8].  Similarity words can be understood by finding feature vectors for each word. Using these vectors ensures understanding of words more than their syntactic or semantic regularities [9]. A novel machine learning approach is needed for analyzing the Social media text where the data are increasing exponentially. In this work, machine learning algorithms such as SVM, Maxent classifier, Decision Tree and Naive Bayes are used for classifying Tamil movie customer reviews into positive and negative [10]. Tamil lyrics can be analyzed on the basis of word usage.  In the conventional approach, audio signals are extracted from the songs for classifying them based on mood and emotion.  In this paper, tf–idf score is used to classify the lyrics based on mood and style. The paper is organized as follows, Section 1, deals with the Introduction of genres and its importance. In Section 2, the proposed methodology is discussed. The performance of the proposed classification is explained in Section 3. Finally, Section 4 concludes the paper by detailing the findings.

 

METHODOLOGY:

Our proposed approach is explained using the architecture given in Figure 1. Raw dataset is made ready for training phase by collecting the relevant data from web sites. The songs are separated based mostly on emotions and mood features.  Each separated song is put in a different folder based on the genre. Tokenization process involves breaking the string of the lyrics into tokens. This can be carried out by considering space, characters or by string. This step is basically done because in the reading process, the words are read word by word. The process is continued using a program written in Java. Now the dataset is ready for the training phase.


 

Figure 1.   Architecture of proposed system

 


The stop words do not carry any meaning. So these words need to be removed. This removal is done by a stop word removal program written in Java. The morphological process is the process of reducing the words onto their root form. For the tf-idf value calculation, the nouns are more significant than the verbs. The tf-idf that is calculated for each word basically denotes the importance that the word holds with respect to the particular genre. Genre-wise tf-idf values are applied as input to Weka tool for classification by using Naïve Bayes algorithm and Support Vector Machine algorithm. Based on the mean squared error and the correctly classified instances, we can decide the algorithm which gives better results.

 

Setup Phase:

There are three levels of genres namely base, mood and style. Each of these levels and the sub classes are explained. To evaluate the performance of the proposed methodology we have collected Tamil songs which are categorized as Philosophy and Romance from the URL http://www.tamilpaa.com/1692-aadadhada-aadadhada-tamil-songs-lyricsx.com, and http://list.ly/list/41R-tamil-philosophical-songs. . Figures 2 and 3 contain the sample songs in text format for genres like Philosophy and Romance. Lyrics were tagged manually based on the three level genre classification. For Example, the Philosophy genre can be represented as:

 

 <genre> Philosophy.Sad.Traditional</genre> 

 

Where Philosophy, Sad and Traditional denote the base, emotion and style of a lyric respectively. There are ten different genres in Tamil songs. The Tamil lyrics are also categorized by three levels of genres namely base, mood and style. The base classifications are character, festival, nature, relationship, romance, occasion, spiritual, patriotic and misc. The emotions classifications are happy, sad, excited, tender, scared and angry. The styles are traditional, folk, contemporary and mixed. To evaluate the performances of the proposed methodology, we have considered Tamil songs from the web site and categorized them as Philosophy and Romance.  Lyrics are tagged manually based on the three level genre classifications. For Example,

 

                                <genre> Romance.Happy.Folk</genre>                                                            

Indicates the tag that Romance, Happy and Folk denote the base, emotion and style of a lyric respectively.

 

Figure 2:  Sample songs in text format Philosophy.txt

 

Figure 3:  Sample songs in text format Romance.txt

 

Stemming Phase:

Stemming is the process of determining the root word of a lyric. Morphological Analyzer is a computational tool, which is used for analyzing Tamil lyrics. Stemming process is carried out by Morphological Analyzer, which analyzes the inflected word. The original root word was constructed with the root or stems with morphemes. Tamil language is agglutinative and rich in morphology. The major inflectional categories in Tamil language are nouns and verbs. The noun morphology of Tamil is extremely simpler compared to verb morphology. Figure 4 shows Examples of Stemming process.

 

Figure 4. Stemming Process

Stop word Removal phase:

Stop words are most commonly used words in any language. These words are repeatedly used in the lyrics, which affect the classification process. For example the words in Figure 5 are repeatedly used in Tamil Songs. Here these words should be removed from training set for ease of classification.

 

Figure 5. Stop words

 

Training Phase:

We combine the lyrics of one genre and create a rich text file. It can be named as “Philosophy document” or “Romance document” and so on.  Each document corresponds to a specific class.

 

tf-idf contains the following two components, namely

tf- Term frequency and

idf - Inverse Document frequency.

The importance of the term (t) in the document is measured by term frequency which is given by

 

tf  = n / h                                                                                (1)

 

where , n’ indicate the number of occurrences of the term in the document, ‘f’  denotes the frequency of occurrences  and ‘h’ indicate the total number of occurrences of all the terms in the document  ‘d’. Hence the Inverse Document frequency idf can be

 

idf = log ( N / df )                                                                (2)

 

where, ‘N’ show the total number of documents  and

df’  signify the number of occurrences of the term  in  all the documents.

 

In the proposed approach, ‘term frequency’ signify the frequency of a term that occurs within a particular document. The term having high frequency in the document indicates the genre of a song. The same term may have less frequency in other genre documents. So tf-idf may be different in each document for the same term. The tf–idf is calculated for each text file (Philosophy and Romance documents). We have implemented the tf-idf calculation using Java. The output contains tf–idf values for each term which act as input for classification algorithm and the values are given in Figure 6. 

 

Figure 6: Unicode of Tamil terms

 

Testing Phase:

In Weka classification tool, the Tamil words are converted to Unicode for analysis purpose. After conversion, .csv file is generated which contains three features namely Unicode of term, category and tf–idf value. The configuration parameter is “10 fold cross validation” for each classifier. For example if we give 2000 songs, it splits the data into ten different data sets.

Figure 7 gives the details of .csv file.

 

Figure 7: Format .csv file

RESULTS AND DISCUSSIONS:

We have used Naïve Bayes and Support Vector Machine classification algorithms to carry out the experiments and the performance results obtained are discussed. Figure 7 gives three attributes which are Unicode, Category and tf – idf.

 

Tables 1 and 2 list the True Positive (TP) Rate, False Positive (FP) Rate, Precision and Recall values. 

As per convention,

 

TP + FP = 1                                                                            3)

 

FP Rate indicate the rate of instances which are falsely classified as a given class.

In Naïve Bayes classification algorithm, using (3)

1 + 0 = 1 and 0.987 + 0.013 = 1. 

Similarly, in SVM,

0.99 + 0.01 = 1 and 0.013 + 0.987 = 1.

 

Precision show the proportion of instances that are truly identified as a class divided by the total instances classified as that class. Recall denotes the proportion of instances classified as a given class divided by the actual number of instances.

 

Table 1 : Evaluation measures for NB

Class

TP Rate

FP Rate

Precision

Recall

Philosopy

1

0.013

0.975

1

Romance

0.987

0

1

0.987

 

Table 2 : Evaluation measures for SVM

Genre

TP Rate

FP Rate

Precision

Recall

Philosopy

0.99

0.013

0.975

0.99

Romance

0.987

0.01

0.995

0.987

 

We have considered 568 instances of genre for classification using both Naïve Bayes and Support Vector Machine algorithms and the results obtained are given in Table 3. From the results, it is observed that 563 instances were classified correctly against 568 in Naïve Bayes algorithm whereas 561 instances were classified against 568 instances in SVM.  So, the Classification Ratio is 563 / 568 = 99.1197 in NB and in SVM the ratio is 561 / 568 = 98.7676.  Based on the above it is concluded that both the algorithms have classified the songs correctly and Naïve Bayes algorithm has classified marginally better than SVM algorithm.

 

Table 3: Comparison of NB and SVM

NB

SVM

Correctly Classified Instances

99.1197%

98.7676%

Incorrectly Classified Instances

0.8803%

1.2324%

Mean absolute error

0.0363

0.0123

Root mean squared error

0.0873

0.111

 

CONCLUSION:

The Tamil musical genre classification is based on three levels as base, mood and style. Our proposed methodology is comparison of genre based Tamil songs classification using Term Frequency and Inverse Document Frequency (tf-idf). Raw dataset is made ready for training phase by collecting the relevant data from web sites. For easy identification of Tamil songs by the listeners, we have considered two algorithms namely Naïve Bayes (NB) and Support Vector Machine (SVM) algorithms and performed experiments with different parameters and the results obtained are compared for their performances.  In this Paper, we have classified Tamil songs using the method based on tf-idf scores. The above two algorithms for genre classification were done using Weka tool. From the experimental results it is observed that, for the two types of genres Philosophy and Romance, Naïve Bayes algorithm has classified much better than the Support Vector Machine algorithm. The classification accuracy for NB algorithm is 99.1197 whereas accuracy for the SVM algorithm is 98.7676. This paper has considered two genres for performing experiments on the genre classification. The work may be extended for inclusion of remaining genres. We have used two algorithms NB and SVM for comparing the accuracies of classification. In future the proposed methodology may be experimented with other classification algorithms also.

 

REFERENCES:

1.        Teh chao Ying, Shyamala Doraisamy, Lili Nurliyana  Abdullah, University Putra Malaysia, “Lyrics-Based Genre Classification Using Variant tf–idf weighing Schemes”, 2015, Vol 15, pp.289– 294.

2.        Karthikeyan, Nandini Karky, Elanchezhiyan, Rajapandian and MadhanKarky, “A Three-level Genre Classification for Tamil Lyrics”, Karky Research Foundation, Chennai, India, May 2013, pp.79– 83.

3.        Minho Kim, Hyuk-Chul Kwon. Dept. of Computer Science, Busan, Republic of Korea, “Lyrics-Based Emotion Classification using  Feature selection by Partial Syntactic Analysis”, DOI 10.1109/ICTAI.2011.165, pp.960–964

4.        Cory McKay, Ichiro Fujinaga, McGill University, “Automatic Genre Classification using large high level musical  feature sets”, 2014,Nov,19, pp. 525-530.

5.        Humberto Corona, Michael, P. O'Mahony, University College Dublin, Ireland, “A Mood-based Genre Classification of Television  Content”, Aug, 2015.

6.        Arijit Ghosal, Rudrasis Chakraborty, Bibhas Chandra Dhara, Sanjoy Kumar Saha, “Perceptual feature-based song genre classification using RANSAC”, International Journal of Computational Intelligence Studies, 2015, Vol 4, pp.31–49.

7.        Teh Chao Ying, Shyamala Doraisamy, Lili Nurliyana Abdullah, UPM, Serdang, Malaysia, “Genre and Mood classification using lyric features”, 2012,30, May, pp.260–263.

8.        Laurier, C., Meyers, O., Serrà, J. et al., “Indexing Music by Mood: Design and integration of an automatic content-based annotator”, Multimedia Tools and Application 2010, May, pp. 161-184.

9.        S. G. Ajay*,M. Srikanth, M. Anand Kuma, K. P. Soman, “Embedding Models for Finding Semantic Relationship between Words in Tamil Language”, Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India., DOI: 10.17485/ijst/2016/v9i45/106478.

10.     Shriya Se*, R. Vinayakumar, M. Anand Kumar, K. P. Soman, “Predicting the Sentimental Reviews in Tamil Movie using Machine Learning Algorithms”, Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapetham, Amrita University, Coimbatore – 641112, Tamil Nadu, India, DOI: 10.17485/ijst/2016/v9i45/106482, December 2016.

 

 

 

 

 

Received on 17.03.2017             Modified on 26.03.2017

Accepted on 06.04.2017           © RJPT All right reserved

Research J. Pharm. and Tech. 2017; 10(5):1449-1454.

DOI: 10.5958/0974-360X.2017.00256.6